Skip to content

fix: unblock Archon when launched from a Claude Code terminal and on Bun 1.3+#1097

Closed
samuelcombey wants to merge 1 commit intocoleam00:devfrom
samuelcombey:fix/claude-sdk-env-and-telegraf-bun-compat
Closed

fix: unblock Archon when launched from a Claude Code terminal and on Bun 1.3+#1097
samuelcombey wants to merge 1 commit intocoleam00:devfrom
samuelcombey:fix/claude-sdk-env-and-telegraf-bun-compat

Conversation

@samuelcombey
Copy link
Copy Markdown

@samuelcombey samuelcombey commented Apr 11, 2026

Summary

Two blockers hit during a fresh setup on macOS, from inside a Claude Code terminal on Bun 1.3.12 — both reproduce deterministically and leave the orchestrator silently hung with no actionable error.

1. Claude Agent SDK subprocess refuses to launch (nested-session guard)

The parent Claude Code shell exports these markers:

  • `CLAUDECODE=1`
  • `CLAUDE_CODE_ENTRYPOINT`
  • `CLAUDE_CODE_EXECPATH`
  • `CLAUDE_CODE_HIDE_ACCOUNT_INFO`
  • `CLAUDE_CODE_NO_FLICKER`
  • `CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS`

Archon's existing cleanup in `buildSubprocessEnv()` (`packages/core/src/clients/claude.ts:167`) only stripped `CLAUDECODE`. Worse, the Claude Agent SDK leaks `process.env` into the spawned child regardless of the explicit `env` option passed to `query()` — so even vars that `SUBPROCESS_ENV_ALLOWLIST` correctly excludes still reach the subprocess. The child then hits its own nested-session guard and the orchestrator hangs after logging `using_global_auth` twice, with no `query_failed`, no `subprocess_error`, no outbound connection to api.anthropic.com, and no visible child process.

The only bulletproof fix until the SDK stops leaking `process.env` is to delete the markers at the process level at server and CLI entry points. Auth vars (`CLAUDE_CODE_OAUTH_TOKEN`, `CLAUDE_CODE_USE_BEDROCK`, `CLAUDE_CODE_USE_VERTEX`) are kept — only the six nested-session markers are removed.

2. Telegram adapter dies on first polling error on Bun 1.3+

Telegraf's `redactToken()` (`node_modules/telegraf/lib/core/network/client.js:248`) does:

```js
error.message = error.message.replace(//(bot|user)(\d+):[^/]+//, '/$1$2:[REDACTED]/');
```

On Bun 1.3+, `Error.message` on certain error instances (fetch/network errors) is non-writable, so the assignment throws `TypeError: Attempted to assign to readonly property.`, kills the polling loop, and leaves the adapter silently dead. The single log line is:

```
telegram.start_failed TypeError: Attempted to assign to readonly property.
at redactToken (telegraf/lib/core/network/client.js:248:5)
```

Patched via `bun patch telegraf` to wrap the assignment in `try/catch`. The original error is still re-thrown unchanged, so Telegraf's own retry/reconnect logic runs normally; the only difference is that we survive one specific Bun runtime quirk.

3. Incidental: `bun.lock` re-sync

`bun patch --commit` re-serialized `bun.lock` and picked up workspace version bumps that were already in the individual `package.json` files (`0.3.5`) but stale in the lockfile (`0.1.0`/`0.2.0`). No behavioral change — just brings the lockfile in sync with reality.

Files

  • `packages/server/src/index.ts` — strip six markers right after dotenv loads, before any application imports
  • `packages/cli/src/cli.ts` — same strip at CLI entry
  • `patches/telegraf@4.16.3.patch` — `try/catch` around the `redactToken` assignment
  • `package.json` — `patchedDependencies` entry for the telegraf patch
  • `bun.lock` — pick up `patchedDependencies` + workspace version re-sync

Reproduction

From inside a Claude Code terminal on Bun 1.3+:

```bash

Before this PR

archon chat "say hi"

Hangs forever after "using_global_auth". No error.

After this PR

archon chat "say hi"

"Hello!" (or similar)

```

And for the Telegram bug: configure a bot token, start the server, run `curl https://api.telegram.org/bot/getUpdates` to trigger a 409 conflict in the long-poll — before this PR the adapter logs `telegram.start_failed` and stops polling forever; after, it logs the upstream error and reconnects.

Test plan

  • `bun run validate` passes (type-check, lint, format, tests)
  • `archon chat` works end-to-end
  • Telegram bot replies to DMs
  • Web UI chat replies
  • Verified CI passes on this branch

Related latent issues (not fixed in this PR)

  1. `~/.archon/workspaces` is the orchestrator's fallback cwd for new conversations without a registered codebase (`orchestrator-agent.ts:743`). If the dir doesn't exist, Bun's `spawn()` hangs silently instead of erroring — another way to reproduce the exact same symptoms as Model stucked at response stream text #1 above. A follow-up PR should `mkdirSync(getArchonWorkspacesPath(), { recursive: true })` during server/CLI startup.
  2. The `process.env` leak in the Claude Agent SDK (Model stucked at response stream text #1 root cause) should be reported upstream to `@anthropic-ai/claude-agent-sdk`.

Summary by CodeRabbit

  • Bug Fixes

    • Fixed error message handling in network operations to work across different runtime environments.
  • Chores

    • Enhanced environment variable cleanup during CLI and server startup to prevent context leakage to spawned subprocesses.
    • Applied dependency patch to improve stability.

…Bun 1.3+

Two blockers hit during a fresh setup from inside a Claude Code session on
Bun 1.3.12; both reproduce deterministically.

1) Claude Agent SDK subprocess refuses to launch (nested-session guard).
   The parent Claude Code shell exports CLAUDECODE, CLAUDE_CODE_ENTRYPOINT,
   CLAUDE_CODE_EXECPATH, CLAUDE_CODE_HIDE_ACCOUNT_INFO, CLAUDE_CODE_NO_FLICKER,
   and CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS. Archon's existing cleanup in
   buildSubprocessEnv() only stripped CLAUDECODE, and the SDK leaks process.env
   into the spawned child regardless of the explicit `env` option passed to
   query() — so the subprocess sees the remaining markers, hits its own
   nested-session guard, and the orchestrator hangs after `using_global_auth`
   with no error. Strip all six markers from process.env at server and CLI
   entry points so nothing can leak. Auth vars (OAUTH_TOKEN, USE_BEDROCK,
   USE_VERTEX) are kept.

2) Telegram adapter dies on first polling error on Bun 1.3+.
   Telegraf's redactToken() does `error.message = error.message.replace(...)`
   to sanitize bot tokens from error stacks. On Bun 1.3+, Error.message on
   fetch/network error instances is non-writable, so the assignment throws
   TypeError, kills the polling loop, and leaves the adapter silently dead
   with `telegram.start_failed`. Patch telegraf@4.16.3 via `bun patch` to
   wrap the assignment in try/catch; the original error is still re-thrown
   so Telegraf's own retry/reconnect logic runs normally.

Also picks up an incidental bun.lock re-sync from `bun patch --commit`: the
workspace package versions in the lockfile were stale (0.1.0/0.2.0) while
the actual package.json files have been 0.3.5 for a while.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 11, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f7eb5279-65aa-4218-9828-390bb246d6f8

📥 Commits

Reviewing files that changed from the base of the PR and between 536584d and f41f4e8.

⛔ Files ignored due to path filters (1)
  • bun.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • package.json
  • packages/cli/src/cli.ts
  • packages/server/src/index.ts
  • patches/telegraf@4.16.3.patch

📝 Walkthrough

Walkthrough

This PR adds environment variable cleanup at startup to prevent nested-session context detection in spawned subprocesses, removes Claude Code markers (CLAUDECODE and related variables) from process.env in CLI and server entry points, and patches telegraf@4.16.3 to safely handle non-writable error messages in Bun environments.

Changes

Cohort / File(s) Summary
Environment Variable Cleanup
packages/cli/src/cli.ts, packages/server/src/index.ts
Deletes Claude Code nested-session environment markers (CLAUDECODE, CLAUDE_CODE_ENTRYPOINT, CLAUDE_CODE_EXECPATH, CLAUDE_CODE_HIDE_ACCOUNT_INFO, CLAUDE_CODE_NO_FLICKER, CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS) from process.env at startup; preserves auth/config variables like CLAUDE_USE_GLOBAL_AUTH.
Telegraf Patch Setup
package.json
Adds patchedDependencies entry to apply patches/telegraf@4.16.3.patch to telegraf@4.16.3.
Telegraf Token Redaction Robustness
patches/telegraf@4.16.3.patch
Wraps error.message mutation in redactToken() with try/catch to handle non-writable error messages in Bun runtime; suppresses failure and rethrows original error if assignment fails.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related issues

Possibly related PRs

Poem

🐰 Away with the nested markers go,
CLAUDECODE and friends—out they flow!
Telegraf now safely redacts with care,
Through Bun's strict world without a snare.
try/catch in patches, env vars cleaned—
Smoother spawning than ever has been! 🌿

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: unblock Archon when launched from a Claude Code terminal and on Bun 1.3+' accurately summarizes the primary fixes: removing environment variable markers and patching Telegraf for Bun 1.3+ compatibility.
Description check ✅ Passed The PR description is comprehensive and detailed, covering the two main blockers, root causes, fixes applied, reproduction steps, and test validation. It aligns well with template expectations despite not following the exact template sections.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f41f4e8917

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

+ // original error rather than killing the polling loop.
+ try {
+ error.message = error.message.replace(/\/(bot|user)(\d+):[^/]+\//, '/$1$2:[REDACTED]/');
+ } catch (_) { /* message is read-only; re-throw original */ }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve token redaction on readonly message errors

Catching and ignoring failures in redactToken means Bun readonly-message errors are rethrown with the original unredacted text, which can include /bot<id>:<token>/ and leak the Telegram bot token into logs. In this exact Bun 1.3+ path, the new behavior fixes retries but drops the security guarantee this function exists to provide; please keep retry behavior while still throwing a redacted error payload.

Useful? React with 👍 / 👎.

Wirasm added a commit that referenced this pull request Apr 12, 2026
…marker strip, tests

1. Align dotenv to ^17 (was ^16, rest of monorepo uses ^17.2.3)
2. Remove incorrect SUBPROCESS_ENV_ALLOWLIST claim from docs — the SDK
   bypasses the env option and uses process.env directly (#1097)
3. Add CLAUDECODE=1 warning to server entry point (was only in CLI)
4. Add diagnostic payload content test for withFirstMessageTimeout
5. Integrate #1097's finding: strip CLAUDECODE + CLAUDE_CODE_* session
   markers (except auth vars) + NODE_OPTIONS + VSCODE_INSPECTOR_OPTIONS
   from process.env at entry point. Pattern-matched on CLAUDE_CODE_*
   prefix rather than hardcoding 6 names, so future Claude Code markers
   are handled automatically. Auth vars (CLAUDE_CODE_OAUTH_TOKEN,
   CLAUDE_CODE_USE_BEDROCK, CLAUDE_CODE_USE_VERTEX) are preserved.

   Root cause per #1097: the Claude Agent SDK leaks process.env into the
   spawned child regardless of the explicit env option, so the only way
   to prevent the nested-session deadlock is to delete the markers from
   process.env at the entry point.

Validation: bun run validate passes, 125 paths tests (6 new marker
tests), 60 claude tests (1 new diagnostic test), DATABASE_URL leak
verified stripped (target repo .env DATABASE_URL does not affect Archon
DB selection).
Wirasm added a commit that referenced this pull request Apr 12, 2026
…only CWD

The allowlist was wrong for a single-developer tool:
- It blocked keys the user intentionally set in ~/.archon/.env
  (ANTHROPIC_API_KEY, AWS_*, CLAUDE_CONFIG_DIR, MiniMax vars, etc.)
- It was bypassed by the SDK anyway (process.env leaks to subprocess
  regardless of the env option — see #1097)
- It attracted a constant stream of PRs adding keys (#1060, #1093, #1099)

New model: CWD .env keys are the only untrusted source. stripCwdEnv()
at entry point handles that. Everything in ~/.archon/.env + shell env
passes through to the subprocess. No filtering, no second-guessing.

Changes:
- Delete env-allowlist.ts and env-allowlist.test.ts
- Simplify buildSubprocessEnv() to return { ...process.env } with
  auth-mode logging (no token stripping — user controls their config)
- Replace 4 allowlist-based tests with 1 pass-through test
- Remove env-allowlist.test.ts from core test batch
- Update security.md and cli.md docs to reflect the new model

The CLAUDECODE + CLAUDE_CODE_* marker strip and NODE_OPTIONS strip
remain in stripCwdEnv() at entry point — those are process-level
safety (not per-subprocess filtering) and are needed regardless.
@Wirasm
Copy link
Copy Markdown
Collaborator

Wirasm commented Apr 12, 2026

Thanks for the excellent debugging on the nested Claude Code session issue — your finding that the SDK leaks process.env regardless of the env option was the key insight we were missing.

We've integrated your CLAUDECODE + CLAUDE_CODE_* marker stripping into #1092, with one difference: we pattern-match on CLAUDE_CODE_* (minus auth vars) rather than hardcoding 6 specific names, so future markers are handled automatically. The stripping runs inside stripCwdEnv() at entry point, covering both CLI and server.

The Telegraf Bun 1.3+ fix in this PR is still needed#1092 doesn't address that. Would you be willing to resubmit a PR with just the Telegraf bun patch fix? Note that Cole also has #1066 open which replaces Telegraf with grammY entirely — you might want to coordinate with him on which approach lands.

Summary:

Wirasm added a commit that referenced this pull request Apr 12, 2026
…t timeout (#1067, #1030, #1098, #1070)

* fix: strip CWD .env leak, enable platform adapters in serve, add first-event timeout (#1067)

Three bugs fixed: (1) Bun auto-loads CWD .env files before user code, leaking
non-overlapping keys into the Archon process — new stripCwdEnv() boot import
removes them before any module reads env. (2) archon serve hardcoded
skipPlatformAdapters:true, preventing Slack/Telegram/Discord from starting.
(3) Claude SDK query had no first-event timeout, causing silent 30-min hangs
when the subprocess wedges — new withFirstMessageTimeout wrapper races the
first event against a configurable deadline (default 60s).

Changes:
- Add @archon/paths/strip-cwd-env and strip-cwd-env-boot modules
- Import boot module as first import in CLI entry point
- Remove skipPlatformAdapters: true from serve.ts
- Add withFirstMessageTimeout + diagnostics to ClaudeClient
- Add CLAUDECODE=1 nested-session warning to CLI
- Add 9 unit tests (6 strip-cwd-env + 3 timeout)

Fixes #1067

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address review findings for PR #1092

Fixed:
- Clear setTimeout timer in withFirstMessageTimeout finally block (HIGH-1)
- Add strip-cwd-env-boot to server/src/index.ts for direct dev:server path (MEDIUM-1)
- Warn to stderr on non-ENOENT errors in stripCwdEnv (MEDIUM-2)
- Update stale configuration.md docs for new env-loading mechanism (HIGH-2)
- Add ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS and ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING env vars to docs
- Add nested Claude Code hang troubleshooting entry
- Fix boot module JSDoc: "CLI and server" → "CLI" only
- Fix stripCwdEnv JSDoc: remove stale "override: true" reference
- Update .claude/rules/cli.md startup behavior section
- Update CLAUDE.md @archon/paths description with new exports

Tests added:
- Assert controller.signal.aborted on timeout
- Handle generator that completes immediately without yielding
- Strip distinct keys from different .env files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* simplify: replace string sentinel with typed error class in withFirstMessageTimeout

Replace the '__timeout__' string sentinel used to identify timeout rejections
with a dedicated FirstEventTimeoutError class. instanceof checks are more
explicit and robust than string comparison on error messages.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address review findings — dotenv version, docs, server warning, marker strip, tests

1. Align dotenv to ^17 (was ^16, rest of monorepo uses ^17.2.3)
2. Remove incorrect SUBPROCESS_ENV_ALLOWLIST claim from docs — the SDK
   bypasses the env option and uses process.env directly (#1097)
3. Add CLAUDECODE=1 warning to server entry point (was only in CLI)
4. Add diagnostic payload content test for withFirstMessageTimeout
5. Integrate #1097's finding: strip CLAUDECODE + CLAUDE_CODE_* session
   markers (except auth vars) + NODE_OPTIONS + VSCODE_INSPECTOR_OPTIONS
   from process.env at entry point. Pattern-matched on CLAUDE_CODE_*
   prefix rather than hardcoding 6 names, so future Claude Code markers
   are handled automatically. Auth vars (CLAUDE_CODE_OAUTH_TOKEN,
   CLAUDE_CODE_USE_BEDROCK, CLAUDE_CODE_USE_VERTEX) are preserved.

   Root cause per #1097: the Claude Agent SDK leaks process.env into the
   spawned child regardless of the explicit env option, so the only way
   to prevent the nested-session deadlock is to delete the markers from
   process.env at the entry point.

Validation: bun run validate passes, 125 paths tests (6 new marker
tests), 60 claude tests (1 new diagnostic test), DATABASE_URL leak
verified stripped (target repo .env DATABASE_URL does not affect Archon
DB selection).

* refactor: remove SUBPROCESS_ENV_ALLOWLIST — trust user config, strip only CWD

The allowlist was wrong for a single-developer tool:
- It blocked keys the user intentionally set in ~/.archon/.env
  (ANTHROPIC_API_KEY, AWS_*, CLAUDE_CONFIG_DIR, MiniMax vars, etc.)
- It was bypassed by the SDK anyway (process.env leaks to subprocess
  regardless of the env option — see #1097)
- It attracted a constant stream of PRs adding keys (#1060, #1093, #1099)

New model: CWD .env keys are the only untrusted source. stripCwdEnv()
at entry point handles that. Everything in ~/.archon/.env + shell env
passes through to the subprocess. No filtering, no second-guessing.

Changes:
- Delete env-allowlist.ts and env-allowlist.test.ts
- Simplify buildSubprocessEnv() to return { ...process.env } with
  auth-mode logging (no token stripping — user controls their config)
- Replace 4 allowlist-based tests with 1 pass-through test
- Remove env-allowlist.test.ts from core test batch
- Update security.md and cli.md docs to reflect the new model

The CLAUDECODE + CLAUDE_CODE_* marker strip and NODE_OPTIONS strip
remain in stripCwdEnv() at entry point — those are process-level
safety (not per-subprocess filtering) and are needed regardless.

* fix: restore override:true for archon env, add integration tests

The integration tests caught a real issue: without override:true, the
~/.archon/.env load doesn't win over shell-inherited env vars. If the
user's shell profile exports PORT=9999 and ~/.archon/.env has PORT=3000,
the user expects Archon to use 3000.

stripCwdEnv() handles CWD .env files (untrusted). override:true handles
shell-inherited vars (trusted but less specific than ~/.archon/.env).
Different concerns, both needed.

Also adds 6 integration tests covering the full entry-point flow:
1. Global auth user with ANTHROPIC_API_KEY in CWD .env — stripped
2. OAuth token in archon env + random key in CWD — CWD stripped, archon kept
3. General leak test — nothing from CWD reaches subprocess
4. Same key in both CWD and archon — archon value wins
5. CLAUDECODE markers stripped even when not from CWD .env
6. CLAUDE_CODE_OAUTH_TOKEN survives marker strip

* test: add DATABASE_URL leak scenarios to env integration tests

* fix: move CLAUDECODE warning into stripCwdEnv, remove dead useGlobalAuth logic

Review findings addressed:

1. CLAUDECODE warning was dead code — the boot import deleted CLAUDECODE
   from process.env before the warning check in cli.ts/server/index.ts
   could fire. Moved the warning into stripCwdEnv() itself, emitted
   BEFORE the deletion. Removed duplicate warning code from both entry
   points.

2. useGlobalAuth token stripping removed (intentional, not regression) —
   the old code stripped CLAUDE_CODE_OAUTH_TOKEN and CLAUDE_API_KEY when
   useGlobalAuth=true. Per design discussion: the user controls
   ~/.archon/.env and all keys they set are intentional. If they want
   global auth, they just don't set tokens. Simplified buildSubprocessEnv
   to log auth mode for diagnostics only, no filtering.

3. Docs "no override needed" corrected — cli.md and configuration.md
   now reflect the actual code (override: true).

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Rasmus Widing <rasmus.widing@gmail.com>
@Wirasm
Copy link
Copy Markdown
Collaborator

Wirasm commented Apr 13, 2026

Closing — this PR bundles two unrelated fixes:

  1. Nested Claude Code marker stripping — superseded by security: prevent target repo .env from leaking into subprocesses (#1135) #1169 (merged into dev). stripCwdEnv() handles marker cleanup at boot, and executableArgs: ['--no-env-file'] prevents re-injection in the Claude subprocess. E2E verified from within a Claude Code session.

  2. Telegraf readonly property crash on Bun 1.3+ — still a valid bug, tracked separately in Telegram adapter crashes on Bun: TypeError readonly property in Telegraf #1042 with an alternative approach in fix: replace Telegraf with grammY to fix Bun TypeError crash #1066 (replace Telegraf with grammY).

Thanks for identifying both issues — the marker stripping analysis directly informed the structural fix that landed.

@Wirasm Wirasm closed this Apr 13, 2026
prospapledge88 pushed a commit to prospapledge88/Archon that referenced this pull request Apr 14, 2026
…t timeout (coleam00#1067, coleam00#1030, coleam00#1098, coleam00#1070)

* fix: strip CWD .env leak, enable platform adapters in serve, add first-event timeout (coleam00#1067)

Three bugs fixed: (1) Bun auto-loads CWD .env files before user code, leaking
non-overlapping keys into the Archon process — new stripCwdEnv() boot import
removes them before any module reads env. (2) archon serve hardcoded
skipPlatformAdapters:true, preventing Slack/Telegram/Discord from starting.
(3) Claude SDK query had no first-event timeout, causing silent 30-min hangs
when the subprocess wedges — new withFirstMessageTimeout wrapper races the
first event against a configurable deadline (default 60s).

Changes:
- Add @archon/paths/strip-cwd-env and strip-cwd-env-boot modules
- Import boot module as first import in CLI entry point
- Remove skipPlatformAdapters: true from serve.ts
- Add withFirstMessageTimeout + diagnostics to ClaudeClient
- Add CLAUDECODE=1 nested-session warning to CLI
- Add 9 unit tests (6 strip-cwd-env + 3 timeout)

Fixes coleam00#1067

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address review findings for PR coleam00#1092

Fixed:
- Clear setTimeout timer in withFirstMessageTimeout finally block (HIGH-1)
- Add strip-cwd-env-boot to server/src/index.ts for direct dev:server path (MEDIUM-1)
- Warn to stderr on non-ENOENT errors in stripCwdEnv (MEDIUM-2)
- Update stale configuration.md docs for new env-loading mechanism (HIGH-2)
- Add ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS and ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING env vars to docs
- Add nested Claude Code hang troubleshooting entry
- Fix boot module JSDoc: "CLI and server" → "CLI" only
- Fix stripCwdEnv JSDoc: remove stale "override: true" reference
- Update .claude/rules/cli.md startup behavior section
- Update CLAUDE.md @archon/paths description with new exports

Tests added:
- Assert controller.signal.aborted on timeout
- Handle generator that completes immediately without yielding
- Strip distinct keys from different .env files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* simplify: replace string sentinel with typed error class in withFirstMessageTimeout

Replace the '__timeout__' string sentinel used to identify timeout rejections
with a dedicated FirstEventTimeoutError class. instanceof checks are more
explicit and robust than string comparison on error messages.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address review findings — dotenv version, docs, server warning, marker strip, tests

1. Align dotenv to ^17 (was ^16, rest of monorepo uses ^17.2.3)
2. Remove incorrect SUBPROCESS_ENV_ALLOWLIST claim from docs — the SDK
   bypasses the env option and uses process.env directly (coleam00#1097)
3. Add CLAUDECODE=1 warning to server entry point (was only in CLI)
4. Add diagnostic payload content test for withFirstMessageTimeout
5. Integrate coleam00#1097's finding: strip CLAUDECODE + CLAUDE_CODE_* session
   markers (except auth vars) + NODE_OPTIONS + VSCODE_INSPECTOR_OPTIONS
   from process.env at entry point. Pattern-matched on CLAUDE_CODE_*
   prefix rather than hardcoding 6 names, so future Claude Code markers
   are handled automatically. Auth vars (CLAUDE_CODE_OAUTH_TOKEN,
   CLAUDE_CODE_USE_BEDROCK, CLAUDE_CODE_USE_VERTEX) are preserved.

   Root cause per coleam00#1097: the Claude Agent SDK leaks process.env into the
   spawned child regardless of the explicit env option, so the only way
   to prevent the nested-session deadlock is to delete the markers from
   process.env at the entry point.

Validation: bun run validate passes, 125 paths tests (6 new marker
tests), 60 claude tests (1 new diagnostic test), DATABASE_URL leak
verified stripped (target repo .env DATABASE_URL does not affect Archon
DB selection).

* refactor: remove SUBPROCESS_ENV_ALLOWLIST — trust user config, strip only CWD

The allowlist was wrong for a single-developer tool:
- It blocked keys the user intentionally set in ~/.archon/.env
  (ANTHROPIC_API_KEY, AWS_*, CLAUDE_CONFIG_DIR, MiniMax vars, etc.)
- It was bypassed by the SDK anyway (process.env leaks to subprocess
  regardless of the env option — see coleam00#1097)
- It attracted a constant stream of PRs adding keys (coleam00#1060, coleam00#1093, coleam00#1099)

New model: CWD .env keys are the only untrusted source. stripCwdEnv()
at entry point handles that. Everything in ~/.archon/.env + shell env
passes through to the subprocess. No filtering, no second-guessing.

Changes:
- Delete env-allowlist.ts and env-allowlist.test.ts
- Simplify buildSubprocessEnv() to return { ...process.env } with
  auth-mode logging (no token stripping — user controls their config)
- Replace 4 allowlist-based tests with 1 pass-through test
- Remove env-allowlist.test.ts from core test batch
- Update security.md and cli.md docs to reflect the new model

The CLAUDECODE + CLAUDE_CODE_* marker strip and NODE_OPTIONS strip
remain in stripCwdEnv() at entry point — those are process-level
safety (not per-subprocess filtering) and are needed regardless.

* fix: restore override:true for archon env, add integration tests

The integration tests caught a real issue: without override:true, the
~/.archon/.env load doesn't win over shell-inherited env vars. If the
user's shell profile exports PORT=9999 and ~/.archon/.env has PORT=3000,
the user expects Archon to use 3000.

stripCwdEnv() handles CWD .env files (untrusted). override:true handles
shell-inherited vars (trusted but less specific than ~/.archon/.env).
Different concerns, both needed.

Also adds 6 integration tests covering the full entry-point flow:
1. Global auth user with ANTHROPIC_API_KEY in CWD .env — stripped
2. OAuth token in archon env + random key in CWD — CWD stripped, archon kept
3. General leak test — nothing from CWD reaches subprocess
4. Same key in both CWD and archon — archon value wins
5. CLAUDECODE markers stripped even when not from CWD .env
6. CLAUDE_CODE_OAUTH_TOKEN survives marker strip

* test: add DATABASE_URL leak scenarios to env integration tests

* fix: move CLAUDECODE warning into stripCwdEnv, remove dead useGlobalAuth logic

Review findings addressed:

1. CLAUDECODE warning was dead code — the boot import deleted CLAUDECODE
   from process.env before the warning check in cli.ts/server/index.ts
   could fire. Moved the warning into stripCwdEnv() itself, emitted
   BEFORE the deletion. Removed duplicate warning code from both entry
   points.

2. useGlobalAuth token stripping removed (intentional, not regression) —
   the old code stripped CLAUDE_CODE_OAUTH_TOKEN and CLAUDE_API_KEY when
   useGlobalAuth=true. Per design discussion: the user controls
   ~/.archon/.env and all keys they set are intentional. If they want
   global auth, they just don't set tokens. Simplified buildSubprocessEnv
   to log auth mode for diagnostics only, no filtering.

3. Docs "no override needed" corrected — cli.md and configuration.md
   now reflect the actual code (override: true).

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Rasmus Widing <rasmus.widing@gmail.com>
Tyone88 pushed a commit to Tyone88/Archon that referenced this pull request Apr 16, 2026
…t timeout (coleam00#1067, coleam00#1030, coleam00#1098, coleam00#1070)

* fix: strip CWD .env leak, enable platform adapters in serve, add first-event timeout (coleam00#1067)

Three bugs fixed: (1) Bun auto-loads CWD .env files before user code, leaking
non-overlapping keys into the Archon process — new stripCwdEnv() boot import
removes them before any module reads env. (2) archon serve hardcoded
skipPlatformAdapters:true, preventing Slack/Telegram/Discord from starting.
(3) Claude SDK query had no first-event timeout, causing silent 30-min hangs
when the subprocess wedges — new withFirstMessageTimeout wrapper races the
first event against a configurable deadline (default 60s).

Changes:
- Add @archon/paths/strip-cwd-env and strip-cwd-env-boot modules
- Import boot module as first import in CLI entry point
- Remove skipPlatformAdapters: true from serve.ts
- Add withFirstMessageTimeout + diagnostics to ClaudeClient
- Add CLAUDECODE=1 nested-session warning to CLI
- Add 9 unit tests (6 strip-cwd-env + 3 timeout)

Fixes coleam00#1067

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address review findings for PR coleam00#1092

Fixed:
- Clear setTimeout timer in withFirstMessageTimeout finally block (HIGH-1)
- Add strip-cwd-env-boot to server/src/index.ts for direct dev:server path (MEDIUM-1)
- Warn to stderr on non-ENOENT errors in stripCwdEnv (MEDIUM-2)
- Update stale configuration.md docs for new env-loading mechanism (HIGH-2)
- Add ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS and ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING env vars to docs
- Add nested Claude Code hang troubleshooting entry
- Fix boot module JSDoc: "CLI and server" → "CLI" only
- Fix stripCwdEnv JSDoc: remove stale "override: true" reference
- Update .claude/rules/cli.md startup behavior section
- Update CLAUDE.md @archon/paths description with new exports

Tests added:
- Assert controller.signal.aborted on timeout
- Handle generator that completes immediately without yielding
- Strip distinct keys from different .env files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* simplify: replace string sentinel with typed error class in withFirstMessageTimeout

Replace the '__timeout__' string sentinel used to identify timeout rejections
with a dedicated FirstEventTimeoutError class. instanceof checks are more
explicit and robust than string comparison on error messages.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address review findings — dotenv version, docs, server warning, marker strip, tests

1. Align dotenv to ^17 (was ^16, rest of monorepo uses ^17.2.3)
2. Remove incorrect SUBPROCESS_ENV_ALLOWLIST claim from docs — the SDK
   bypasses the env option and uses process.env directly (coleam00#1097)
3. Add CLAUDECODE=1 warning to server entry point (was only in CLI)
4. Add diagnostic payload content test for withFirstMessageTimeout
5. Integrate coleam00#1097's finding: strip CLAUDECODE + CLAUDE_CODE_* session
   markers (except auth vars) + NODE_OPTIONS + VSCODE_INSPECTOR_OPTIONS
   from process.env at entry point. Pattern-matched on CLAUDE_CODE_*
   prefix rather than hardcoding 6 names, so future Claude Code markers
   are handled automatically. Auth vars (CLAUDE_CODE_OAUTH_TOKEN,
   CLAUDE_CODE_USE_BEDROCK, CLAUDE_CODE_USE_VERTEX) are preserved.

   Root cause per coleam00#1097: the Claude Agent SDK leaks process.env into the
   spawned child regardless of the explicit env option, so the only way
   to prevent the nested-session deadlock is to delete the markers from
   process.env at the entry point.

Validation: bun run validate passes, 125 paths tests (6 new marker
tests), 60 claude tests (1 new diagnostic test), DATABASE_URL leak
verified stripped (target repo .env DATABASE_URL does not affect Archon
DB selection).

* refactor: remove SUBPROCESS_ENV_ALLOWLIST — trust user config, strip only CWD

The allowlist was wrong for a single-developer tool:
- It blocked keys the user intentionally set in ~/.archon/.env
  (ANTHROPIC_API_KEY, AWS_*, CLAUDE_CONFIG_DIR, MiniMax vars, etc.)
- It was bypassed by the SDK anyway (process.env leaks to subprocess
  regardless of the env option — see coleam00#1097)
- It attracted a constant stream of PRs adding keys (coleam00#1060, coleam00#1093, coleam00#1099)

New model: CWD .env keys are the only untrusted source. stripCwdEnv()
at entry point handles that. Everything in ~/.archon/.env + shell env
passes through to the subprocess. No filtering, no second-guessing.

Changes:
- Delete env-allowlist.ts and env-allowlist.test.ts
- Simplify buildSubprocessEnv() to return { ...process.env } with
  auth-mode logging (no token stripping — user controls their config)
- Replace 4 allowlist-based tests with 1 pass-through test
- Remove env-allowlist.test.ts from core test batch
- Update security.md and cli.md docs to reflect the new model

The CLAUDECODE + CLAUDE_CODE_* marker strip and NODE_OPTIONS strip
remain in stripCwdEnv() at entry point — those are process-level
safety (not per-subprocess filtering) and are needed regardless.

* fix: restore override:true for archon env, add integration tests

The integration tests caught a real issue: without override:true, the
~/.archon/.env load doesn't win over shell-inherited env vars. If the
user's shell profile exports PORT=9999 and ~/.archon/.env has PORT=3000,
the user expects Archon to use 3000.

stripCwdEnv() handles CWD .env files (untrusted). override:true handles
shell-inherited vars (trusted but less specific than ~/.archon/.env).
Different concerns, both needed.

Also adds 6 integration tests covering the full entry-point flow:
1. Global auth user with ANTHROPIC_API_KEY in CWD .env — stripped
2. OAuth token in archon env + random key in CWD — CWD stripped, archon kept
3. General leak test — nothing from CWD reaches subprocess
4. Same key in both CWD and archon — archon value wins
5. CLAUDECODE markers stripped even when not from CWD .env
6. CLAUDE_CODE_OAUTH_TOKEN survives marker strip

* test: add DATABASE_URL leak scenarios to env integration tests

* fix: move CLAUDECODE warning into stripCwdEnv, remove dead useGlobalAuth logic

Review findings addressed:

1. CLAUDECODE warning was dead code — the boot import deleted CLAUDECODE
   from process.env before the warning check in cli.ts/server/index.ts
   could fire. Moved the warning into stripCwdEnv() itself, emitted
   BEFORE the deletion. Removed duplicate warning code from both entry
   points.

2. useGlobalAuth token stripping removed (intentional, not regression) —
   the old code stripped CLAUDE_CODE_OAUTH_TOKEN and CLAUDE_API_KEY when
   useGlobalAuth=true. Per design discussion: the user controls
   ~/.archon/.env and all keys they set are intentional. If they want
   global auth, they just don't set tokens. Simplified buildSubprocessEnv
   to log auth mode for diagnostics only, no filtering.

3. Docs "no override needed" corrected — cli.md and configuration.md
   now reflect the actual code (override: true).

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Rasmus Widing <rasmus.widing@gmail.com>
joaobmonteiro pushed a commit to joaobmonteiro/Archon that referenced this pull request Apr 26, 2026
…t timeout (coleam00#1067, coleam00#1030, coleam00#1098, coleam00#1070)

* fix: strip CWD .env leak, enable platform adapters in serve, add first-event timeout (coleam00#1067)

Three bugs fixed: (1) Bun auto-loads CWD .env files before user code, leaking
non-overlapping keys into the Archon process — new stripCwdEnv() boot import
removes them before any module reads env. (2) archon serve hardcoded
skipPlatformAdapters:true, preventing Slack/Telegram/Discord from starting.
(3) Claude SDK query had no first-event timeout, causing silent 30-min hangs
when the subprocess wedges — new withFirstMessageTimeout wrapper races the
first event against a configurable deadline (default 60s).

Changes:
- Add @archon/paths/strip-cwd-env and strip-cwd-env-boot modules
- Import boot module as first import in CLI entry point
- Remove skipPlatformAdapters: true from serve.ts
- Add withFirstMessageTimeout + diagnostics to ClaudeClient
- Add CLAUDECODE=1 nested-session warning to CLI
- Add 9 unit tests (6 strip-cwd-env + 3 timeout)

Fixes coleam00#1067

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address review findings for PR coleam00#1092

Fixed:
- Clear setTimeout timer in withFirstMessageTimeout finally block (HIGH-1)
- Add strip-cwd-env-boot to server/src/index.ts for direct dev:server path (MEDIUM-1)
- Warn to stderr on non-ENOENT errors in stripCwdEnv (MEDIUM-2)
- Update stale configuration.md docs for new env-loading mechanism (HIGH-2)
- Add ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS and ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING env vars to docs
- Add nested Claude Code hang troubleshooting entry
- Fix boot module JSDoc: "CLI and server" → "CLI" only
- Fix stripCwdEnv JSDoc: remove stale "override: true" reference
- Update .claude/rules/cli.md startup behavior section
- Update CLAUDE.md @archon/paths description with new exports

Tests added:
- Assert controller.signal.aborted on timeout
- Handle generator that completes immediately without yielding
- Strip distinct keys from different .env files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* simplify: replace string sentinel with typed error class in withFirstMessageTimeout

Replace the '__timeout__' string sentinel used to identify timeout rejections
with a dedicated FirstEventTimeoutError class. instanceof checks are more
explicit and robust than string comparison on error messages.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address review findings — dotenv version, docs, server warning, marker strip, tests

1. Align dotenv to ^17 (was ^16, rest of monorepo uses ^17.2.3)
2. Remove incorrect SUBPROCESS_ENV_ALLOWLIST claim from docs — the SDK
   bypasses the env option and uses process.env directly (coleam00#1097)
3. Add CLAUDECODE=1 warning to server entry point (was only in CLI)
4. Add diagnostic payload content test for withFirstMessageTimeout
5. Integrate coleam00#1097's finding: strip CLAUDECODE + CLAUDE_CODE_* session
   markers (except auth vars) + NODE_OPTIONS + VSCODE_INSPECTOR_OPTIONS
   from process.env at entry point. Pattern-matched on CLAUDE_CODE_*
   prefix rather than hardcoding 6 names, so future Claude Code markers
   are handled automatically. Auth vars (CLAUDE_CODE_OAUTH_TOKEN,
   CLAUDE_CODE_USE_BEDROCK, CLAUDE_CODE_USE_VERTEX) are preserved.

   Root cause per coleam00#1097: the Claude Agent SDK leaks process.env into the
   spawned child regardless of the explicit env option, so the only way
   to prevent the nested-session deadlock is to delete the markers from
   process.env at the entry point.

Validation: bun run validate passes, 125 paths tests (6 new marker
tests), 60 claude tests (1 new diagnostic test), DATABASE_URL leak
verified stripped (target repo .env DATABASE_URL does not affect Archon
DB selection).

* refactor: remove SUBPROCESS_ENV_ALLOWLIST — trust user config, strip only CWD

The allowlist was wrong for a single-developer tool:
- It blocked keys the user intentionally set in ~/.archon/.env
  (ANTHROPIC_API_KEY, AWS_*, CLAUDE_CONFIG_DIR, MiniMax vars, etc.)
- It was bypassed by the SDK anyway (process.env leaks to subprocess
  regardless of the env option — see coleam00#1097)
- It attracted a constant stream of PRs adding keys (coleam00#1060, coleam00#1093, coleam00#1099)

New model: CWD .env keys are the only untrusted source. stripCwdEnv()
at entry point handles that. Everything in ~/.archon/.env + shell env
passes through to the subprocess. No filtering, no second-guessing.

Changes:
- Delete env-allowlist.ts and env-allowlist.test.ts
- Simplify buildSubprocessEnv() to return { ...process.env } with
  auth-mode logging (no token stripping — user controls their config)
- Replace 4 allowlist-based tests with 1 pass-through test
- Remove env-allowlist.test.ts from core test batch
- Update security.md and cli.md docs to reflect the new model

The CLAUDECODE + CLAUDE_CODE_* marker strip and NODE_OPTIONS strip
remain in stripCwdEnv() at entry point — those are process-level
safety (not per-subprocess filtering) and are needed regardless.

* fix: restore override:true for archon env, add integration tests

The integration tests caught a real issue: without override:true, the
~/.archon/.env load doesn't win over shell-inherited env vars. If the
user's shell profile exports PORT=9999 and ~/.archon/.env has PORT=3000,
the user expects Archon to use 3000.

stripCwdEnv() handles CWD .env files (untrusted). override:true handles
shell-inherited vars (trusted but less specific than ~/.archon/.env).
Different concerns, both needed.

Also adds 6 integration tests covering the full entry-point flow:
1. Global auth user with ANTHROPIC_API_KEY in CWD .env — stripped
2. OAuth token in archon env + random key in CWD — CWD stripped, archon kept
3. General leak test — nothing from CWD reaches subprocess
4. Same key in both CWD and archon — archon value wins
5. CLAUDECODE markers stripped even when not from CWD .env
6. CLAUDE_CODE_OAUTH_TOKEN survives marker strip

* test: add DATABASE_URL leak scenarios to env integration tests

* fix: move CLAUDECODE warning into stripCwdEnv, remove dead useGlobalAuth logic

Review findings addressed:

1. CLAUDECODE warning was dead code — the boot import deleted CLAUDECODE
   from process.env before the warning check in cli.ts/server/index.ts
   could fire. Moved the warning into stripCwdEnv() itself, emitted
   BEFORE the deletion. Removed duplicate warning code from both entry
   points.

2. useGlobalAuth token stripping removed (intentional, not regression) —
   the old code stripped CLAUDE_CODE_OAUTH_TOKEN and CLAUDE_API_KEY when
   useGlobalAuth=true. Per design discussion: the user controls
   ~/.archon/.env and all keys they set are intentional. If they want
   global auth, they just don't set tokens. Simplified buildSubprocessEnv
   to log auth mode for diagnostics only, no filtering.

3. Docs "no override needed" corrected — cli.md and configuration.md
   now reflect the actual code (override: true).

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Rasmus Widing <rasmus.widing@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants